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Abstract 

The problem of making sequential decisions in unknown probabilistic environ- 
ments is studied. In cycle t action y t results in perception x t and reward r t , where 
all quantities in general may depend on the complete history The perception xt and 
reward rt are sampled from the (reactive) environmental probability distribution \i. 
This very general setting includes, but is not limited to, (partial observable, k-th 
order) Markov decision processes. Sequential decision theory tells us how to act in 
order to maximize the total expected reward, called value, if \i is known. Reinforce- 
ment learning is usually used if \x is unknown. In the Bayesian approach one defines 
a mixture distribution £ as a weighted sum of distributions uGM, where M is any 
class of distributions including the true environment fi. We show that the Bayes- 
optimal policy p^ based on the mixture £ is self-optimizing in the sense that the 
average value converges asymptotically for all /^GAi to the optimal value achieved 
by the (infeasible) Bayes-optimal policy which knows fi in advance. We show 
that the necessary condition that A4 admits self-optimizing policies at all, is also 
sufficient. No other structural assumptions are made on hA. As an example applica- 
tion, we discuss ergodic Markov decision processes, which allow for self-optimizing 
policies. Furthermore, we show that p^ is Pareto-optimal in the sense that there 
is no other policy yielding higher or equal value in all environments v G A4 and a 
strictly higher value in at least one. 



*This work was supported by SNF grant 2000-61847.00 to Jiirgen Schmidhuber. 
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1 Introduction 



Reinforcement learning: There exists a well developed theory for reinforcement learn- 
ing agents in known probabilistic environments (like Blackjack) called sequential decision 
theory ||Bel57| , [Ber95|] . The optimal agent is the one which maximizes the future expected 
reward sum. This setup also includes deterministic environments (like static mazes). 
Even adversarial environments (like Chess or Backgammon) may be seen as special cases 
in some sense ||HutOO| , ch.6] (the reverse is also true ||BT00|| ). Sequential decision theory 
deals with a wide range of problems, and provides a general formal solution in the sense 
that it is mathematically rigorous and (uniquely) specifies the optimal solution (leaving 
aside computational issues). The theory breaks down when the environment is unknown 
(like when driving a car in the real world). Reinforcement learning algorithms exist for un- 
known Markov decision processes (mdps) with small state space, and for other restricted 
[KLM96| . |SB98| . [Bcr95| , |KV86|| , but even in these cases their learning rate is usually 



classes 



far from optimum. 



Performance measures: In this work we are interested in general (probabilistic) en- 
vironmental classes Ai. We assume Ai is given, and that the true environment /i is 
in Ai, but is otherwise unknown. The expected reward sum (value) V? when following 
policy p is of central interest. We are interested in policies p which perform well (have 
high value) independent of what the true environment fi EAi is. A natural demand from 
an optimal policy is that there is no other policy yielding higher or equal value in all 
environments u&Ai and a strictly higher value in one uEAi. We call such a property 
Pareto-optimality. The other quantity of interest is how close Vjjf is to the value V* of 
the optimal (but infeasible) policy p^ which knows \i in advance. We call a policy whose 
average value converges asymptotically for all /x G Ai to the optimal value V* if \x is the 
true environment, self- optimizing. 



Main new results for Bayes-mixtures: We define the Bayes-mixture £ as a weighted 
average of the environments v^Ai and analyze the properties of the Bayes-optimal policy 
p^ which maximizes the mixture value V%. One can show that not all environmental 
classes Ai admit self-optimizing policies. One way to proceed is to search for and prove 



weaker properties than self-optimizingness [HutOO]. Here we follow a different approach: 



Obviously, the least we must demand from Ai to have a chance of finding a self-optimizing 
policy is that there exists some self-optimizing policy p at all. The main new result of 
this work is that this necessary condition is also sufficient for p^ to be self-optimizing. No 
other properties need to be imposed on Ai. The other new result is that p^ is always 
Pareto-optimal, with no conditions at all imposed on Ai. 



Contents: Section |^ defines the model of agents acting in general probabilistic environ- 
ments and defines the finite horizon value of a policy and the optimal value-maximizing 
policy. Furthermore, the mixture-distribution is introduced and the fundamental linear- 
ity and convexity properties of the mixture- values is stated. Section [| defines and proves 
Pareto-optimality of p^. The concept is refined to balanced Pareto-optimality, showing 
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that a small increase of the value for some environments only leaves room for a small 
decrease in others. Section ^ shows that is self-optimizing if M. admits self-optimizing 
policies, and also gives the speed of convergence in the case of finite Ai. The finite horizon 
model has several disadvantages. For this reason Section |5| defines the discounted (infinite 
horizon) future value function, and the corresponding optimal value-maximizing policy. 
Pareto-optimality and self-optimizingness of p^ are shown shown for this model. As an 
application we show in Section |6| that the class of ergodic mdps admits self-optimizing 
policies w.r.t. the undiscounted model and w.r.t. the discounted model if the effective 
horizon tends to infinity. Together with the results from the previous sections this shows 
that p^ is self-optimizing for erdodic mdps. Conclusions and outlook can be found in 
Section [7|. 



2 Rational Agents in Probabilistic Environments 

The agent model: A very general framework for intelligent systems is that of rational 
agents [ RN95|| . In cycle k, an agent performs action Uk&y (output) which results in a 



perception or observation x k EX (input), followed by cycle k+1 and so on. We assume 
that the action and perception spaces X and y are finite. We write p{x < k)=yi : k to 
denote the output yi-k=yi---Vk °f the agents policy p on input x <k ^Xi...Xk-i and similarly 
Q(yi:k)— x i-.k f° r the environment q in the case of deterministic environments. We call 
policy p and environment q behaving in this way chronological. Note that policy and 
environment are allowed to depend on the complete history. We do not make any mdp or 
pomdp assumption here, and we don't talk about states of the environment, only about 
observations. In the more general case of a probabilistic environment, given the history 
ip < kyk=yxi---W : k-iyk=yiXi---yk-iXk-iyk, the probability that the environment leads to 
perception x fc in cycle k is (by definition) piifx^WLk)- The underlined argument x k in p is 
a random variable and the other non-underlined arguments yx <k yk represent conditions.[] 
We call probability distributions like p chronological. Since value optimizing policies 
can always be chosen deterministic, there is no real need to generalize the setting to 
probabilistic policies. Arbitrarily we formalize Sections |3| and [| in terms of deterministic 
policies and Section |5| in terms of probabilistic policies. 



Value functions and optimal policies: The goal of the agent is to maximize future 
rewards, which are provided by the environment through the inputs Xk- The inputs 
Xk' = -XyXfc are divided into a regular part x' k and some (possibly empty or delayed) reward 
r k E [0 , r maa ,].0 We use the abbreviation 

p{w<kyxk:m) = p(y£<kWk)-p(y ]C i-.k , m+i)- - ■p(w<mW m )i (i) 

■"■The standard notation p{xu 1 2/ c <fe3/fe ) f° r conditional probabilities destroys the chronological order and 
would become quite confusing in later expressions. 

2 In the reinforcement learning literature when dealing with (po)mdps the reward is usually considered 
to be a function of the environmental state. The zero-assumption analogue here is that the reward r k is 
some probabilistic function p' depending on the complete history. It is very convenient to integrate 
into Xk and p' into p. 
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which is essentially Bayes rules, and e = yx < \ f° r the empty string. The p-expected reward 
sum (value) of future cycles k to m with outputs y k:m generated by the agent's policy 
p, the optimal policy p p which maximizes the value, its action y k and the corresponding 
value can formally be defined as follows. 



Definition 1 (Value function and optimal policy) We define the value of policy p 
in environment p given history yx <k , or shorter, the p-value of p given yx <k , as 

:m)\yi:m=P(x<m,)- (2) 

m is the lifespan or initial horizon of the agent. The p-optimal policy p p which maximizes 
the (total) value : = V^(e) is 

K:=argmax\/;, VZ(w<k) := <V<*)- (3) 

Explicit expressions for the action y k in cycle k of the p-optimal policy p p and their value 

y k = argmax^max^ ... max^(r fc + ... +r m )- p(yx <k yx k:m ) } (4) 

Vk ~~ Vk + 1 • Vm ~ 



V kZ(w c <k) = max^maxj] ... max^(r fc + ... +r m )- p{yx <k yx k , m ). (5) 
yk — 2/fc+i — V m — 



where yx <k is the actual history. 



One can show ||HutOO|| that these definitions are consistent and correctly capture our 
intention. For instance, consider the expectimax expression ([|): The best expected reward 
is obtained by averaging over possible perceptions Xi and by maximizing over the possible 
actions y^. This has to be done in chronological order y k x k ...y m x m to correctly incorporate 
the dependency of Xi and yi on the history. Obviously 



<kj 



>VZ(w<k)Vp, especially V* > V* Vp. 



(6) 



Known environment p: Let us now make a change in conventions and assume that p is 
the true environment in which the agent operates and that we know p (like in Blackjack). f\ 
Then, policy p p is optimal in the sense that no other policy for an agent leads to higher 
/x-expected reward. This setting includes as special cases deterministic environments, 
Markov decision processes (mdps), and even adversarial environments for special choices 
of p ||HutOO||. There is no principle problem in determining the optimal action y k as long 



as p is known and computable and X, y and m are finite. 



3 If the existence of true objective probabilities violates the philosophical attitude of the reader he may 
assume a deterministic environment [i. 
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The mixture distribution £ : Things drastically change if /x is unknown. For (param- 
eterized) MDPs with small state (parameter) space, suboptimal reinforcement learning 
algorithms may be used to learn the unknown /x ||KLM96| , (3B98| , [Ber95| , [KV86|| . In the 



Bayesian approach the true probability distribution /x is not learned directly, but is re- 
placed by a Bayes-mixture £. Let us assume that we know that the true environment /x 
is contained in some known set M. of environments. For convenience we assume that M. 
is finite or countable. The Bayes-mixture £ is defined as 

£(j£i:m) = w v u (.Wim) with w " = w " > Vz/ G (7) 

The weights w v may be interpreted as the prior degree of belief that the true environment 
is v. Then £(?xc 1:m ) could be interpreted as the prior subjective belief probability in 
observing x\. m , given actions y\. m . It is, hence, natural to follow the policy p^ which 
maximizes V£. If /x is the true environment the expected reward when following policy p^ 

will be V£ . The optimal (but infeasible) policy p^ yields reward = V*. It is now of 

interest (a) whether there are policies with uniformly larger value than V]f and (6) how 

close Vf is to V* These are the main issues of the remainder of this work. 

A universal choice of £ and M. : One may also ask what the most general class M. and 
weights w u could be. Without any prior knowledge we should include all environments 
in M.. In this generality this approach leads at best to negative results. More useful 
is the assumption that the environment possesses some structure, we just don't know 
which. From a computational point of view we can only unravel effective structures 
which are describable by (semi) computable probability distributions. So we may include 
all (semi) computable (semi) distributions in M.. Occam's razor tells us to assign high 
prior belief to simple environments. Using Kolmogorov's universal complexity measure 
K{y) for environments v one should set w v ~ 2~ K ( v > , where K(v) is the length of the 
shortest program on a universal Turing machine computing v. The resulting policy p^ 
has been developed and intensively discussed in |[ButOO|| . It is a unification of sequential 



decision theory |[Bel57| , [Ber95|| and Solomonoff's celebrated universal induction scheme 



Sol78| , |LV97|| . In the following we consider generic A4 and w y . The following property of 



V p is crucial. 

Theorem 1 (Linearity and convexity of V p in p) is a linear function in p and 
V* is a convex function in p in the sense that 

V i = H W » V v and V i - W » V v Wkere f fern) = £ W v u(yX l:m ) 

veM veM ueM 

Proof: Linearity is obvious from the definition of V£. Convexity follows from V^=V^ — 

Yl,v w v^t <Hv w vVv i where the identity is definition (||), the equality uses linearity of Vf 
just proven, and the last inequality follows from the dominance (||) and non- negativity of 
the weights w u . □ 

One loose interpretation of the convexity is that a mixture can never increase performance. 
In the remainder of this work /x denotes the true environment, p any distribution, and £ 
the Bayes-mixture of distributions u&Ai. 
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3 Pareto Optimality of policy 

The total /i-expected reward Vjf of policy p^ is of central interest in judging the per- 
formance of policy p€. We know that there are policies (e.g. p^) with higher /x-value 
(V^ > V^f 5 ). In general, every policy based on an estimate p of p which is closer to p than 
£ is, outperforms p^ in environment p, simply because it is more taylored toward p. On 
the other hand, such a system probably performs worse than p^ in other environments. 
Since we do not know p in advance we may ask whether there exists a policy p with better 
or equal performance than p^ in all environments vEAi and a strictly better performance 
for one ugA4. This would clearly render p^ suboptimal. We show that there is no such 
p. 

Theorem 2 (Pareto optimality) Policy p^ is Pareto- optimal in the sense that there is 
no other policy p with V£>Vf for all uEM. and strict inequality for at least one v . 

Proof: We want to arrive at a contradiction by assuming that p^ is not Pareto-optimal, 
i.e. by assuming the existence of a policy p with V^>Vf for all v<^M. and strict inequality 
for at least one v: 

V V 

The two equalities follow from linearity of V p (Theorem [I]). The strict inequality follows 
from the assumption and from w v > 0. The identity is just Definition |l]([3]). The last 
inequality follows from the fact that p^ maximizes by definition the universal value (|6|). 
The contradiction VJ>V^ proves Pareto-optimality of policy p^.U 

Pareto-optimality should be regarded as a necessary condition for an agent aiming to be 
optimal. From a practical point of view a significant increase of V for many environments 
v may be desirable even if this causes a small decrease of V for a few other v. The 
impossibility of such a "balanced" improvement is a more demanding condition on p^ 
than pure Pareto-optimality. The next theorem shows that p^ is also balanced-Pareto- 
optimal in the following sense: 

Theorem 3 (Balanced Pareto optimality) 

A„ := Vf - Vf A := £ w v A v =S> A > 0. 

This implies the following: Assume p has lower value thanp^ on environments £ by a total 
weighted amount of Ac'-=J2\ec w \A\- Thenp can have higher value on r]^T-C:=Ai\C, but 
the improvement is bounded by A-^ := \ J2 v en w v^v\ — Ac- Especially \ A v \ ^.w^maxx^cAx- 

This means that a weighted value increase A-^ by using p instead of p^ is compensated 
by an at least as large weighted decrease Ac on other environments. If the decrease is 
small, the increase can also only be small. In the special case of only a single environment 
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with decreased value A\, the increase is bound by A^ < ^|A^|, i.e. a decrease by an 
amount A a can only cause an increase by at most the same amount times a factor 

For the choice of the weig hts w v ~2- K M, a decrease can only cause a smaller increase in 
simpler environments, but a scaled increase in more complex environments. Finally note 
that pure Pareto-optimality (Theorem |2|) follows from balanced Pareto-optimality in the 
special case of no decrease Ac = 0. 

Proof: A > follows from A = *£, v w v [V^ — V$\ = — > 0, where we have used linearity 

of V p (Theorem [Xj) and dominance Vjf >Vj? (||). The remainder of Theorem || is obvious 
from 0<A = A£ — A^ and by bounding the weighted average A v by its maximum.D 



4 Self-optimizing Policy w.r.t. Average Value 

In the following we study under which circumstances^ 

-> -VZ for rn -> oo for all v G M. (8) 

The least we must demand from M. to have a chance that (|8|) is true is that there exists 
some policy p at all with this property, i.e. 

3p: ±VZ^±VZ for m^oo for all v E M. (9) 

Luckily, this necessary condition will also be sufficient. This is another (asymptotic) 
optimality property of policy p^. If universal convergence in the sense of (|9|) is possible 
at all in a class of environments Ai, then policy p^ converges in the sense of (|]). We will 



call policies p with a property like (§) self-optimizing |[KV86|| . The following two Lemmas 
pave the way for proving the convergence Theorem. 

Lemma 1 (Value difference relation) 

< V: - V* =: A u => < V u * - Vf < ±A with A := w u A u 

Proof: The following sequence of inequalities proves the lemma: 

o < w u \v;-vf] < E„w u [v;-vf} < z v w v \y;-vf] = e,^a„ = a 

In the first and second inequality we used w u > and V* — Vf >0. The last inequality 
follows from Z v w v Vf = vf = V£ > Vf = E^V* □ 

We also need some results for averages of functions 8 v {m) >0 converging to zero. 

Here and elsewhere we interpret a m —>b m as an abbreviation for a m — b m — >0. limm^oo&m may not 
exist. 
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Lemma 2 (Convergence of averages) For S(m) : = J2ueM w ^^( m ) the following holds 
(we only need J2u w u < 1 )'■ 

i) 5 u (m) < f(m) Vz/ implies 5(m) < f(m). 
ii) 8 u (m) "^5° o \/v implies 5{m) if < 5 v {m) < c. 

Proof: {i) immediately follows from 5{m) = J2v w v3v{ m ) ^^2u w ^f( m ) <f( m )- F° r (H) we 
choose some order on M. and some z/ G M. large enough such that J2l>>i> w v < £ - Using 
S„(m) <c this implies 

w u 5 u {m) < Y w » c - e - 

Furthermore, the assumption ^(m)— >0 means that there is an m„ e depending on v and 
£ such that 5 u (m) <e for all m>m ue . This implies 

V" w v 5 v {m) < w u e < e for all m > max{m re } — : m e . 

m £ <oo, since the maximum is over a finite set. Together we have 

5(m) = ^ w v 5 v {m) < 2e for m > m £ 5{m) — > for m — ► oo 

since £ was arbitrary and 5(m) >0. □ 

Theorem 4 (Self-optimizing policy w.r.t. average value) There exists a sequence 
of policies p m , m— 1,2,3,... w«to value within A(m) to optimum for all environments vEM., 
then, save for a constant factor, this also holds for the sequence of universal policies p^, 

i) If 3p m Vv : VZ ~ Vt V < A(m) =► VZ ~ < ±A(m). 

// there exists a sequence of self- optimizing policies p m in the sense that their expected 
average reward ^V\Z V converges to the optimal average —VZ f or a ^ environments v^M., 
then this also holds for the sequence of universal policies p^ m , i.e. 

ii) If 3p m Wu : ±vtr ivz => ivrt ™ ivz. 

The beauty of this theorem is that if universal convergence in the sense of (^) is possible 
at all in a class of environments A4, then policy p^ converges (in the sense of @). The 
necessary condition of convergence is also sufficient. The unattractive point is that this is 
not an asymptotic convergence statement for V%Jf of a single policy p^ for — > oo for some 
fixed m, and in fact no such theorem could be true, since always k <m. The theorem 
merely says that under the stated conditions the average value of p^ can be arbitrarily 
close to optimum for sufficiently large (pre-chosen) horizon m. This weakness will be 
resolved in the next subsection. 

Proof: (i) A„(m) = f(m) implies A(m) = /(m) by Lemma @(«). Inserting this in Lemma 
[l] proves Theorem ^(z) (recovering the m dependence and finally renaming /^A). 
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(ii) We define 8 v (m) := —A u (m) = — \V* —V£]. Since we assumed bounded rewards 0<r< 
r max we have 

v:< iTIT max and V* > => A y < mr mra => < <^(m) < c := r max . 
The premise in Theorem is that 5 y {m) = ^[V{^ — Vf^\ ^0 which implies 

< -\ySk-V&\ < = -^(m) -> 0. 

The inequalities follow from Lemma [l] and convergence to zero from Lemma This 
proves Theorem H(ii). □. 

In Section || we show that a converging p exists for ergodic mdps, and hence p^ converges 
in this environmental class too (in the sense of Theorem |J). 

5 Discounted Future Value Function 

We now shift our focus from the total value Vi m , m^oo to the future value (value-to-go) 
VjcT, k^oo. The main reason is that we want to get rid of the horizon parameter m. In 
the last subsection we have shown a convergence theorem for m— >oo, but a specific policy 
is defined for all times relative to a fixed horizon m. Current time k is moving, but m 
is fixedQ. Actually, to use k—>oo arguments we have to get rid of m, since k<m. This is 
the reason for the question mark in Vj--? above. 

We eliminate the horizon by discounting the rewards r k ~> with YliZili < 00 an d 
letting m-^oo. The analogue of m is now an effective horizon which may be defined 

k+h eil 

by J2i=k k Ik ~ Y^°l k+h eff Ik- See ||Hut00| , Ch.4] for a detailed discussion of the horizon 



problem. Furthermore, we renormalize V koo by Y^Lkli an d denote it by V kl . It can be 
interpreted as a future expected weighted-average reward. Furthermore we extend the 
definition to probabilistic policies tt. 

Definition 2 (Discounted value function and optimal policy) We define the 7 dis- 
counted weighted- average future value of (probabilistic) policy tt in environment p given 
history yx <k , or shorter, the p-value of tt given yx <k , as 

V^(yx <k ) := — - lim ^ {lkr k + ■■■ +l m r m )p(yx <k yx k , m )^(yx <k ^ k .. rr ^ 



with T k : = J2iZ k li- The policy p p is defined as to maximize the future value V£f : 
p p := arg max V kj p , V k p := V^ p = max V£[ > V k p Vvr. 



5 A dynamic horizon like m~~>mk = k 2 can lead to policies with very poor performance [HutOC. 



, Ch.4]. 
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Remarks: 

• n(yx <k yx k]m ) is actually independent of x m , since n is chronological. 

• Normalization of I4 7 by does not affect the policy p p . 

• The definition of p p is independent of k. 

• Without normalization by the future values would converge to zero for k— >oo in 
every environment for every policy. 

• For an mdp environment, a stationary policy, and geometric discounting 7fc~7 fe , the 
future value is independent of k and reduces to the well-known mdp value function. 

• There is always a deterministic optimizing policy p p (which we use). 

• For a deterministic policy there is exactly one yu-.m f° r each x k - m with tt^O. The 
sum over y k:m drops in this case. 

• An iterative representation as in Definition p] is possible. 

• Setting 7fc = 1 for k<m and 7^ = for k>m gives back the undiscounted model ([I]) 

withV^ = ;MC. 

• Vfc 7 (and w k defined below) depend on the realized history yx <k . 



Similarly to the previous sections one can prove the following properties: 



Theorem 5 (Linearity and convexity of V p in p) V k ^ is a linear function in p and 
is a convex function in p in the sense that 



Vtf = E < V% and V< < £ < V k 
ueM veM 

where ^(yx <k yx k , m ) = J2 w k v{w<kVLk: m ) with w k :=w u 



v(m<k) 



The conditional representation of £ can be proven by dividing the definition (0) of £ (yx_i :m ) 
by £{y% <k ) an d by using Bayes rules ([!]). The posterior weight w k may be interpreted as 
the posterior belief in v and is related to learning aspects of policy ?A 

Theorem 6 (Pareto optimality) For every k and history yx, <k the following holds: p^ 

is Pareto- optimal in the sense that there is no other policy 11 with V£?>V£ Y l/ for alluEM. 
and strict inequality for at least one v . 

Lemma 3 (Value difference relation) 

< V% - V*f» =:Al =► < V% - vfi" < ±A k with A k : = £ <A£ 



The proof of Theorem || and Lemma § follows the same steps as for Theorem |2| and Lemma 
[1] with appropriate replacements. The proof of the analogue of the convergence Theorem 
[| involves one additional step. We abbreviate "with p probability 1" by w./x.p.l. 
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Theorem 7 (Self-optimizing policy p^ w.r.t. discounted value) For any M., if there 
exists a sequence of self-optimizing policies 7Tk k — 1,2,3,... in the sense that their expected 
weighted-average reward V k ^ u converges for k^oo with u -probability one to the optimal 
value V£y for all environments v&M., then this also holds for the universal policy p^ in 
the true \x- environment, i.e. 

If3Z k Vv:V*f k -=^ V% w.u.p.l =► V^ k ^V^ w.fi.p.1. 

The probability qualifier refers to the historic perceptions x<k- The historic actions y <k 
are arbitrary. 



The conclusion is valid for action histories y <k if the condition is satisfied for this action 
history. Since we usually need the conclusion for the p^-action history, which is hard to 
characterize, we usually need to prove the condition for all action histories. Theorem 
is a powerful result: An (inconsistent) sequence of probabilistic policies itk suffices to 
prove the existence of a (consistent) deterministic policy jr. A result similar to Theorem 
U(i) also holds for the discounted case, roughly saying that — V* = 0(A(k)) implies 
VP i -V* = - £ 0{A(k)) with ji probability l-e for finite M. 

Proof: We define S v (k) := A k = V k " — V k " . Since we assumed bounded rewards 0<r <r max 
and V£f is a weighted average of rewards we have 

V^<r max and V*?>0 =S> < 5 u (k) = A£ < c := r max . 

The following inequalities follow from Lemma |3|: 

< V% - V k f < ±A k = ±6(k) ^ (10) 

The premise in Theorem [7| is that 5 v (k)=Vj~—V£?—>0 for k— >oo which implies S(k)^0 
(w./i.p.l) by Lemma §(««). What is new and what remains to be shown is that w k 
is bounded from below in order to have convergence of ([TOD to zero. We show that 
Zfc-i := ^ = > converges to a finite value, which completes the proof. Let E 

denote the /i expectation. Then 

E[z k \x <k \ = 2^ KW<kyX k )— = * r < — r = Z k -1 

X Kmi-.k) Km<k) vm<k) 

J2' Xk runs over all x k with ji{'yx vk )^Q. The first equality holds w./x.p.l. In the second equal- 
ity we have used Bayes rule twice. Ef^x^.] <Zk-i shows that — z k is a semi-martingale. 
Since —z k is non-positive, |poo53 , Th.4.1s(z),p324] implies that — Zk converges for fc— >oo 
to a finite value w.ji.p.l. □ 



6 Markov Decision Processes 

From all possible environments, Markov (decision) processes are probably the most inten- 
sively studied ones. To give an example, we apply Theorems |] and ^ to ergodic Markov 
decision processes, but we will be very brief. 
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Definition 3 (Ergodic Markov Decision Processes) We call fi a (stationary) Markov 
Decision Process (mdp) if the probability of observing Xk £ X , given history yx^Dk does 
only depend on the last action yk^y and the last observation Xk-\, i.e. if f^(yx < kykX k ) = 
fi(yxk-ix k ) . In this case Xk is called a state, X the state space, and [i(yxk-ix k ) the tran- 
sition matrix. An mdp // is called ergodic if there exists a policy under which every state 
is visited infinitely often with probability 1. Let M.mdp be the set of mdps and Mmdpi 
be the set of ergodic mdps. If an mdp fi{yxk-ix k ) is independent of the action yk-i it is 
a Markov process, if it is independent of the last observation Xk-i it is an i.i.d. process. 

Stationary mdps /i have stationary optimal policies p^ mapping the same state / obser- 
vation Xt always to the same action y t . On the other hand a mixture £ of mdps is itself 
not an mdp, i.e. £ ^M-mdp-, which implies that p^ is, in general, not a stationary policy. 
The definition of ergodicity given here is least demanding, since it only demands on the 
existence of a single policy under which the Markov process is ergodic. Often, stronger 
assumptions, e.g. that every policy is ergodic or that a stationary distribution exists, are 
made. We now show that there are self-optimizing policies for the class of ergodic mdps 
in the following sense. 

Theorem 8 (Self-optimizing policies for ergodic mdps) There exist self-optimizing 
policies p m for the class of ergodic mdps in the sense that 

i) 3p m \fveM M DPi:iVZ-iVt V < c u m- 1 " m -=^ 0, 

where c v are some constants. In the discounted case, if the discount sequence 7^ has 
unbounded effective horizon h e ^ 00, then there exist self- optimizing policies Tik for the 
class of ergodic mdps in the sense that 

11) BfrkVueMuDPi ■ VkT ^ V% if ^ - 1. 



There is much literature on constructing and analyzing self-optimizing learning algorithms 
in mdp environments. The assumptions on the structure of the mdps vary, all include some 
form of ergodicity, often stronger than Definition ^ demanding that the Markov process is 
ergodic under every policy. See, for instance, ||KV86| , |Ber95|| . We will only briefly outline 



one algorithm satisfying Theorem |] without trying to optimize performance. 

Proof idea: For (i) one can choose a policy p m which performs (uniformly) random 
actions in cycles l...ko— 1 with l<^iko<^m and which follows thereafter the optimal policy 
based on an estimate of the transition matrix T^ s , = v{as§!) from the initial k$ — l cycles. 
The existence of an ergodic policy implies that for every pair of states s start ,sG X there 
is a sequence of actions and transitions of length at most \X\ — 1 such that state s is 
reached from state s start- The probability that the "right" transition occurs is at least 
Tmin with T m j n being the smallest non-zero transition probability in T. The probability 
that a random action is the "right" action is at least |3^| _1 - So the probability of reaching 
a state s in \X\ — 1 cycles via a random policy is at least (Tmm/I^l)'*'" 1 - In state s 
action a is taken with probability 1 3^ | 1 and leads to state s' with probability T s a s , > 
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T m i n . Hence, the expected number of transitions s A s' to occur in the first fco cycles 
is > pfr {T m in/\y\y x \ ~ ^o-Q The accuracy of the frequency estimate T^ s , of T s a s , hence is 

— 1/2 

~ k . Similar mdps lead to "similar" optimal policies, which lead to similar values. 
More precisely, one can show that T— T~A; ' 1 ^ 2 implies the same accuracy in the average 
value, i.e. \^V^^ — ^V^ m \ ~A; , where p m is the optimal policy based on T and * is 
the optimal policy based on T(=u). Since — Vi/^^ , (i) follows (with probability 1) by 
setting /c ~ m2//3 - The policy p m can be derandomized, showing (z) for sure. 

The discounted case (ii) can be proven similarly. The history yt^ is simply ignored and 
the analogue to m-^oo is h\ f — >oo for k — >oo, which is ensured by ^j^-— >oo. Let tt^. be the 
policy which performs (uniformly) random actions in cycles k...k — l with <C &o <S /ijf 
and which follows thereafter the optimal policy^ based on an estimate T of the transition 
matrix T from cycles k...ko — 1. The existence of an ergodic policy, again, ensures that 
the expected number of transitions s-^s' occurring in cycles k...k — 1 is proportional to 
A:=ko — k. The accuracy of the frequency estimate T of T is ~A -1 / 2 which implies 

Cf^*o7 for A = A; -A;->oo, (11) 

where 7Tfc is the optimal policy based on T and * is the optimal policy based on T(=u). It 
remains to show that the achieved reward in the random phase k...ko — 1 gives a negligible 
contribution to V& 7 . The following implications for k-^oo are easy to show: 

► i => ► i =^ ► i =>• y~ 2^ 7^ < -p— [r fc +A - r fc ] -> o. 

Since convergence to zero is true for all fixed finite A it is also true for sufficiently slowly 
increasing A(&) — >oo. This shows that the contribution of the first A rewards rfc+...+rfc _i 
to \4 7 is negligible. Together with ([0]) this shows V£* v — > for k Q := k + A(k).D 

The conditions r*. < oo and ^±±- -^1 on the discount sequence are, for instance, satisfied 
for 7 fc = l//c 2 , so the Theorem is not vacuous. The popular geometric discount r )k = l k 
fails the latter condition; it has finite effective horizon. [ HutOOII gives a detailed account 
on discount and horizon issues, and motivates hf: — > oo philosophically. 

Together with Theorems ^ and [5], Theorem [8] immediately implies that policy p^ is self- 
optimizing for the class of to ergodic mdps. 



Corollary 1 (Policy p^ is self-optimizing for ergodic mdps) If M. is a finite or count- 
able class of ergodic MDPs, and £() :=J2v€M w v u () > then policies p^ m maximizing Vf^ and 
p^ maximizing are self- optimizing in the sense that 

MveM: iV?t ™ iVZ and ^ V% if ^ - 1. 

If M. is finite, then the speed of the first convergence is at least 0(m -1 / 3 ). 
6 For T s a s , =0 the estimate f s a s , =0 is exact. 

7 For non-geometric discounts as here, optimal policies are, in general, not stationary. 
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7 Conclusions 



Summary: We studied agents acting in general probabilistic environments with rein- 
forcement feedback. We only assumed that the true environment \x belongs to a known 
class of environments Ai, but is otherwise unknown. We showed that the Bayes-optimal 
policy based on the Bayes-mixture £ = X^e.M u V z/ i s Pareto-optimal and self-optimizing 
if Ai admits self-optimizing policies. The class of ergodic MDPs admitted self-optimizing 
policies w.r.t. the average value and w.r.t. the discounted value if the effective horizon 
grew indefinitely. 



Continuous classes Ai: There are uncountably many (ergodic) MDPs. Since we have 
restricted our development to countable classes Ai we had to give the Corollary for a 
countable subset of Mm dpi- We may choose Ai as the set of all ergodic MDPs with 
rational (or computable) transition probabilities. In this case Ai is a dense subset of 
AImdpi which is, from a practical point of view, sufficiently rich. On the other hand, it 
is possible to extend the theory to continuously parameterized families of environments 
Hq and £ = JwefiedO. Under some mild (differentiability and existence) conditions, most 
results of this work remain valid in some form, especially Corollary [I] for all ergodic MDPs. 



Bayesian self-optimizing policy: Policy p^ with unbounded effective horizon for er- 
godic MDPs is the first purely Bayesian self-optimizing consistent policy for ergodic MDPs. 
The policies of all previous approaches were either hand crafted, like the ones in the proof 
of Theorem [5], or were Bayesian with a pre-chosen horizon m, or with geometric dis- 
counting 7 with finite effective horizon (which does not allow self-optimizing policies) 
KV86| , |Ber95|| . The combined conditions 1^ < oo and — > 1 allow a consistent self- 



optimizing Bayes-optimal policy based on mixtures. 



Bandits: Bandits are a special subclass of ergodic MDPs. In a two-armed bandit prob- 
lem you pull repeatedly one of two levers resulting in a gain of A$l with probability pi 
for arm number %. The game can be described as an mdp with parameters pi. If the pi 
are unknown, Corollary [l] shows that policy pfc yields asymptotically optimal payoff. The 
discounted unbounded horizon approach and result is, to the best of our knowledge, even 
new when restricted to Bandits. 



Other environmental classes: Bandits, i.i.d. processes, classification tasks, and many 
more are all special (degenerate) cases of ergodic MDPs, for which Corollary [l] shows that 
p^ is self-optimizing. But the existence of self-optimizing policies is not limited to (sub- 
classes of ergodic) MDPs. Certain classes of pomdps, k th order ergodic MDPs, factorizable 
environments, repeated games, and prediction problems are not MDPs, but neverthe- 
less admit self-optimizing policies (to be shown elsewhere), and hence the corresponding 
Bayes-optimal mixture policy p^ is self-optimizing by Theorems |] and |7|. 
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Outlook: 



along the lines of [HutOl 



Future research could be the derivation of non-asymptotic bounds, possibly 
To get good bounds one may have to exploit extra properties 

Another possibility is to search 



of the environments, like the mixing rate of mdps [[KS98 



for other performance criteria along the lines of ||HutOO| , Ch.6], especially for the universal 
prior [Bol78|[ and for the Speed prior |Sch02|] . Finally, instead of convergence of the 



expected reward sum, studying convergence with high probability of the actual reward 
sum would be interesting. 
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